Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Locality-sensitive hashing for the edit distance

Identifieur interne : 000500 ( Main/Exploration ); précédent : 000499; suivant : 000501

Locality-sensitive hashing for the edit distance

Auteurs : Guillaume Marçais [États-Unis] ; Dan Deblasio [États-Unis] ; Prashant Pandey [États-Unis] ; Carl Kingsford [États-Unis]

Source :

RBID : PMC:6612865

Abstract

AbstractMotivation

Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

Results

We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.

Availability and implementation

The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019.

Supplementary information

Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btz354
PubMed: 31510667
PubMed Central: 6612865


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Locality-sensitive hashing for the edit distance</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31510667</idno>
<idno type="pmc">6612865</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612865</idno>
<idno type="RBID">PMC:6612865</idno>
<idno type="doi">10.1093/bioinformatics/btz354</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000B23</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000B23</idno>
<idno type="wicri:Area/Pmc/Curation">000B23</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000B23</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000258</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000258</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:31510667</idno>
<idno type="wicri:Area/PubMed/Corpus">000422</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000422</idno>
<idno type="wicri:Area/PubMed/Curation">000422</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000422</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000495</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000495</idno>
<idno type="wicri:Area/Ncbi/Merge">002339</idno>
<idno type="wicri:Area/Ncbi/Curation">002339</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">002339</idno>
<idno type="wicri:doubleKey">1367-4803:2019:Marcais G:locality:sensitive:hashing</idno>
<idno type="wicri:Area/Main/Merge">000503</idno>
<idno type="wicri:Area/Main/Curation">000500</idno>
<idno type="wicri:Area/Main/Exploration">000500</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Locality-sensitive hashing for the edit distance</title>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4">
<nlm:aff id="btz354-aff1">Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<title>Abstract</title>
<sec id="s1">
<title>Motivation</title>
<p>Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of
<italic>k</italic>
-mers and do not take into account the relative ordering of
<italic>k</italic>
-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.</p>
</sec>
<sec id="s2">
<title>Results</title>
<p>We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the
<italic>k</italic>
-mer contents of the sequences but also to the relative order of the
<italic>k</italic>
-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.</p>
</sec>
<sec id="s3">
<title>Availability and implementation</title>
<p>The code to generate the results is available at
<ext-link ext-link-type="uri" xlink:href="http://github.com/Kingsford-Group/omhismb2019">http://github.com/Kingsford-Group/omhismb2019</ext-link>
.</p>
</sec>
<sec id="s4">
<title>Supplementary information</title>
<p>
<xref ref-type="supplementary-material" rid="sup1">Supplementary data</xref>
are available at
<italic>Bioinformatics</italic>
online.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Aldous, D" uniqKey="Aldous D">D. Aldous</name>
</author>
<author>
<name sortKey="Diaconis, P" uniqKey="Diaconis P">P. Diaconis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alonso, O" uniqKey="Alonso O">O. Alonso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Backurs, A" uniqKey="Backurs A">A. Backurs</name>
</author>
<author>
<name sortKey="Indyk, P" uniqKey="Indyk P">P. Indyk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bar Yossef, Z" uniqKey="Bar Yossef Z">Z. Bar-Yossef</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berlin, K" uniqKey="Berlin K">K. Berlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Broder, A Z" uniqKey="Broder A">A.Z. Broder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chum, O" uniqKey="Chum O">O. Chum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Drew, J" uniqKey="Drew J">J. Drew</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fredman, M L" uniqKey="Fredman M">M.L. Fredman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gollapudi, S" uniqKey="Gollapudi S">S. Gollapudi</name>
</author>
<author>
<name sortKey="Panigrahy, R" uniqKey="Panigrahy R">R. Panigrahy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Harris, R S" uniqKey="Harris R">R.S. Harris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hunt, J W" uniqKey="Hunt J">J.W. Hunt</name>
</author>
<author>
<name sortKey="Szymanski, T G" uniqKey="Szymanski T">T.G. Szymanski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Indyk, P" uniqKey="Indyk P">P. Indyk</name>
</author>
<author>
<name sortKey="Motwani, R" uniqKey="Motwani R">R. Motwani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jaffe, D B" uniqKey="Jaffe D">D.B. Jaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jain, C" uniqKey="Jain C">C. Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kushilevitz, E" uniqKey="Kushilevitz E">E. Kushilevitz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B. Langmead</name>
</author>
<author>
<name sortKey="Salzberg, S L" uniqKey="Salzberg S">S.L. Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, H" uniqKey="Lee H">H. Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Levenshtein, V I" uniqKey="Levenshtein V">V.I. Levenshtein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R. Durbin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, C M" uniqKey="Liu C">C.-M. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, C" uniqKey="Luo C">C. Luo</name>
</author>
<author>
<name sortKey="Shrivastava, A" uniqKey="Shrivastava A">A. Shrivastava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G. Marçais</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, E W" uniqKey="Myers E">E.W. Myers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ondov, B D" uniqKey="Ondov B">B.D. Ondov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ostrovsky, R" uniqKey="Ostrovsky R">R. Ostrovsky</name>
</author>
<author>
<name sortKey="Rabani, Y" uniqKey="Rabani Y">Y. Rabani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raff, E" uniqKey="Raff E">E. Raff</name>
</author>
<author>
<name sortKey="Nicholas, C" uniqKey="Nicholas C">C. Nicholas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shrivastava, A" uniqKey="Shrivastava A">A. Shrivastava</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, W" uniqKey="Wu W">W. Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, M" uniqKey="Zhao M">M. Zhao</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Pennsylvanie</li>
</region>
<settlement>
<li>Pittsburgh</li>
</settlement>
<orgName>
<li>Université Carnegie-Mellon</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Pennsylvanie">
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
</region>
<name sortKey="Deblasio, Dan" sort="Deblasio, Dan" uniqKey="Deblasio D" first="Dan" last="Deblasio">Dan Deblasio</name>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<name sortKey="Pandey, Prashant" sort="Pandey, Prashant" uniqKey="Pandey P" first="Prashant" last="Pandey">Prashant Pandey</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000500 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000500 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:6612865
   |texte=   Locality-sensitive hashing for the edit distance
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:31510667" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021